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EXECUTIVE SUMMARY 


The availability of Business Activity Statement (BAS) data collected by the Australian 
Taxation Office (ATO) has provided the Australian Bureau of Statistics (ABS) with 
Opportunities to improve the efficiency of sample design and estimation for its 
business surveys. The ABS business surveys currently use two methods of estimation; 
number-raised estimation and ratio estimation. While ratio estimation allows the use 
of one auxiliary variable to improve the precision of the estimates, generalised 
regression (GREG) estimation allows the use of more than one auxiliary variable, and 
hence has the potential to be more efficient (i.e. reduce the current sample sizes for 
ABS business surveys with no reduction in the precision of the estimates) than 
number-raised and ratio estimation. 


The generalised regression estimator is unbiased with respect to the assumed model. 
However, if by chance there are several units in the sample with unusually large 
residuals under the generalised regression model, then the generalised regression 
estimator may grossly underestimate or overestimate the population totals. One 
solution to this problem is to modify values outside preset cutoff values to values 
closer to these cutoff values. This estimator is called the ‘winsorized’ estimator. 
Although the winsorized estimator is biased, it may have a considerably smaller mean 
squared error than the generalised regression estimator. 


In order to minimise the mean squared error of the winsorized estimator, the choice 
of the cutoff values can be written as functions of the proposed regression model and 
the bias of the winsorized estimator. The suitability of the winsorized estimator will 
ultimately depend on the choice of the cutoff values, and hence the methods used to 
estimate the bias parameters and regression parameters used to calculate these cutoff 
values. 


The estimate of bias parameters can be calculated using the approach outlined in 
Kokic and Bell (1994). However, there are many solutions to the problem of 
estimating robust regression parameters. It is worth noting that the task is not to find 
the best robust regression model, but rather to find a robust regression fitting 
procedure which results in the best performing winsorized estimator. Furthermore, 
since this procedure will have to be used for many ABS business surveys and 
understood by a range of people, it is desirable that the procedure is simple, flexible 


and easily traceable. An investigation was performed using a number of different 
robust regression fitting techniques to determine which techniques resulted in the 
best performing winsorized estimator. 


A simulation study was undertaken to assess the performance of these various 
methods to estimate robust regression parameters, and hence estimate the cutoff 
values used in the winsorized estimator. The simulation study examined: 


° the ‘best’ robust regression fitting technique; 

° the data used to calculate cutoff values under winsorization; 

° the level to calculate bias parameters under winsorization; and 

° the sample weights to calculate the cutoff values under winsorization. 


One of the key findings of the study was that diagnostics should be incorporated into 
the regression fitting procedure before using the regression parameters to generate 
cutoff values. In particular, checking that the regression model fits the data well, 
checking that units with large influences are removed from the regression model, and 
checking that the regression model fits to current data to be winsorized. 


There often exists linear relationships between the various data items collected and 
derived in ABS business surveys, and it is important that these linear relationships still 
hold after winsorization. The current ABS estimation system allows the linear 
relationships to be maintained by two methods. Unfortunately, there are some 
situations where these two methods perform quite poorly. An alternative method 
which attempts to overcome the shortcomings of the two methods is suggested, 
which requires the specification of a distance function between the original and final 
winsorized values. Although any one of a number of distance functions could be used, 
the one examined in this paper is the generalised least squares distance function. 


DISCUSSION POINTS FOR MAC 


The questions for MAC members in relation to winsorization for generalised 


regression estimation are: 


Is the solution under linear interpolation to estimate the bias parameters better 
than taking the last positive breakpoint? 


Is it logical to apply the same regression model used for the generalised 
regression estimator, to generate the regression parameters for the winsorized 
cutoff values? Should this same model be used to form winsorized cutoff values 
for all variables collected in the survey? If a regression model is known which fits 
another variable better, then should this regression model be used to form 
winsorization cutoff values for this other variable, even though it was not used 
for the generalised regression estimator? 


Is it logical to fit regression model at different levels to the generalised 
regression model? 


What is the best way to ensure the regression model fitted to the historical data 
is still applicable to the current data to be winsorized? What action should be 
taken when the regression model does not fit the current data to be winsorized? 


What is the best way to deal with units with large historical values which have an 
adverse influence on the estimate of the bias parameter? 


Is there any reason why the design weights should not be used in the calculation 
of cutoff values used in the winsorized estimator? 


Is the concept of minimising a distance function to ensure linear relationships 
between variables still hold after winsorization appropriate? Is the generalised 
least squares distance function appropriate? 
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The role of the Methodology Advisory Committee (MAC) is to review and direct research 


into the collection, estimation, dissemination and analytical methodologies associated 
with ABS statistics. Papers presented to the MAC are often in the early stages of 
development, and therefore do not represent the considered views of the Australian 
Bureau of Statistics or the members of the Committee. Readers interested in the 
subsequent development of a research topic are encouraged to contact either the author 
or the Australian Bureau of Statistics. 
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1. INTRODUCTION 


The availability of Business Activity Statement (BAS) data collected by the Australian 
Taxation Office (ATO) has provided the Australian Bureau of Statistics (ABS) with 
Opportunities to improve the efficiency of sample design and estimation for its 
business surveys. The ABS business surveys currently use two methods of estimation; 
number-raised estimation and ratio estimation. While ratio estimation allows the use 
of one auxiliary variable to improve the precision of the estimates, generalised 
regression (GREG) estimation allows the use of more than one auxiliary variable, and 
hence has the potential to be more efficient (i.e. reduce the current sample sizes for 
ABS business surveys with no reduction in the precision of the estimates) than 
number-raised and ratio estimation. The BAS data will potentially be a rich source of 
auxiliary variables for use in GREG estimation. 


1.1 Generalised regression estimator 

Consider a finite population U = {1,...,7,..., NV} , from which a probability sample 
s(s CU) is drawn according to a sample design with selection probabilities 
m,=Pr(ies). The sampling weights W; =1/Z; are those used in the 


Horvitz-Thompson estimator, !y)7 = 4 W;); , for variable of interest y. The objective 
ies 


is to estimate the population total Y = ¥° y,; , where J; is the value of the variable of 
ieU 
interest y for unit 7. Assume there exists a set of auxiliary variables 
WS (Nite ee ee ve for which the population totals ¢,. = 2 Xi are known. The 
1E 
generalised regression estimator is given by (Sarndal, Swensson and Wretman, 1992): 


T 
beak a Vey fs Ss B 


ies les 


= 
where p= [> WiX;X} | [> WIXI } 


ies ies 


The generalised regression estimator is often written as: 


Pes = D8; = oi; 


ies ies 
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where g; the g-weight for unit 7, defined as: 
24 T 
T is 
8; =| 1+ | Dei’ | | te Dees 


ies ies 


1.2 Winsorization 


The generalised regression estimator is unbiased with respect to the assumed model. 
However, if by chance there are several units in the sample with unusually large 
residuals under the generalised regression model, then the generalised regression 
estimator may grossly underestimate or overestimate the population totals. It is 
desirable to robustify the estimator against such unusually large residuals. There are 
two distinct solutions to this problem. The first is to modify the weights associated 
with these units (Hidiroglou and Srinath, 1981), while the second is to modify the 
values of the variables of interest for these units. One approach to this second 
solution is to modify values outside preset cutoff values to values closer to these cutoff 
values. This estimator is called the ‘winsorized’ estimator (Searls, 1966). Although the 
winsorized estimator is biased, it may have a considerably smaller mean squared error 
than the generalised regression estimator. 


Let ¢ = > 0; y; be an unbiased estimator of the population total, under the model: 
les 


Var (Y,)=0; 


U 


and suppose the winsorized estimator of the population total is given by: 


Lwin > Dw); 


ies 


where the winsorized value, y; , is calculated using the Type II winsorized estimator 
(Gross, Bode, Taylor and Lloyd—Smith 1986), modified for two-sided winsorization: 


i 1 : 
MG ue 


Vi =); tiKy 29,2 Ky 


1 1 F 
[Zo (2 ky ify; < Ky; 
Ww; WwW; 
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(i.e. the outlier contributes its unweighted values, while the non-sampled units, 
represented by the remainder of the weight, w, —1, contribute preset upper or lower 


cutoff values, Ky; and K,,;, to the estimate of the population). 


In order to minimise the mean squared error of the winsorized estimator under the 
model, the choice of the cutoff value (Clark 1995) is given by: 


x B 
Ku; = by; “T,-1 
1 
P B 
Ky; = My; “Tt 
1 


where Li; =E(¥;), and B, and B, are the bias of t ywinu and i ‘ywink : 


A 


By = Elbywinu —t, 


A 


By = Ell wink =i 


y 
where !ywinu is the estimate of population total when only upper winsorization is 
performed and L oest is the estimate of population total when only lower 
winsorization is performed. 

In practice yu; is difficult to estimate. Under the assumptions that winsorization is 


mild and reasonably symmetric Ui; is replaced with 4; to give approximately optimal 


cutoffs: 
B 
Ky; = Ly; 2 (w =i) 
i 
B 
Ky; = fl; — (i “Ty 
i 
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2. CHOICE OF CUTOFF VALUES 


The suitability of the winsorized estimator will ultimately depend on the choice of the 
cutoff values, and hence the methods used to estimate the bias parameters and 
regression parameters used to calculate these cutoff values. 


2.1 Estimation of bias parameters 


The estimate of bias parameter 5,, under winsorization depends on the values of the 
upper cutoffs, and can be calculated using the approach outlined in Kokic and Bell 
(1994). Firstly, define weighted residuals: 


D; =(¥;-#)@; -Y) 


and let U =—B,, such that Ky; =; + , then the upper bias parameter can be 


sot! 
(w; a 1) 
written as: 


Aa 


By OY) = Elfywinu — by | 
= > a; —1{E£[min(y,, Ky) | — L;} 


=>) Elmin{ Y%; — 4); -D, (Kui — BH; - D3] 


= > E[min{D,, U}] 
= ¥ E[min{0,U7—D)}+D}] 
=—-E[>\max{D, -U,0}] 


les 
The value of B,, can be found by solving the equation: 


U- [> max{D; -U,0}]=0 
ies 
Let fi; be a robust estimate of “, and define D, =(¥; —f;)(w;—1), then the previous 
equation is piecewise linear with discontinuities at U = D; . By setting 
Day 2 Day 2>...202... as the ordered values of D,, the distinct breakpoints of the 


equation can be expressed as: 


Wu (Dee) = Dewy — DamaxtDiy — Dewy, 0} 


ies 


k 
=(k+1) Dey - Dy 
Mie 
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Therefore, the optimal value of U can be found by solving the equation yw; (U ) =0. In 


general there will be no exact solution to this equation. The solution given by 
Chambers, Kokic, Smith and Cruddas (2000) is: 


Doc 
+1) (J) 


> 
II 
— 
~~ 
% 


where k. is the last value of k for which Wu (Dey) is non-negative. 


k 
1 % 
An alternative solution is to use linear interpolation between —=__~ > Di jf) and 


(REL) = 
ik R'+1 
(Rk 4D) 2 Dj) 
Le R' +1 
oh Pwo) cea) +1) 2 burl ve (Bus eign 
il in 
Z v(b (k ay) 7 (Dery) 


If there is limited data or the extreme weighted residuals differ significantly, then the 
solution under linear interpolation will produce a lower value of U than taking to the 
last positive Wy (Duy) . Hence, the solution under linear interpolation will reduce, to 


some extent, the influence of individual units on the value of U 


QUESTION 1: Is the solution under linear interpolation to estimate the bias 
parameters better than taking the last positive breakpoint? 


The estimate of the lower bias parameter 8; under winsorization can be found in the 
same way. Let L=—8, and then the lower bias parameter can be written as: 

B,(L)= Et ywink a y] 
=-E[>) min{D, - £,0}] 


les 


By setting Eq) < Ew) <...<08... as the ordered values of D, then the distinct 


breakpoints of the equation are: 
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V1 (Bee) = Eee min {Ey — Ee, 0} 
S 


The linear interpolation solution to the equation W; (Z) =0 is: 


Pag 


- P ‘ R41 
Wi Fen) ay bkob twas 2 as] 
(v (Eenay) M(B ) 


where Ris the last value of k for which y;, (Ei %)) is non-positive. 


The upper and lower bias parameter estimates, By and B;, depend on the values of 
the top and bottom weighted residuals Day: Days oD ies and Bay Eas E cgay. If 
the data used to generate the cutoff values is the same as the data for which the 

winsorized cutoff values are to be applied, then it is those units with the values of the 
top and bottom weighted residuals that will be winsorized. In this case the estimated 


bias, Br + B, , will be realised. However, if the data used to generate the cutoff values 
is different then it is assumed that data for which the winsorized cutoff values are to 
be applied fits the same model as the data used to generate the cutoff values. If this 


assumption holds then the realised bias should be approximately By +B far 


2.2 Estimate of robust regression parameters 


Suppose the generalised regression estimator is based on the model: 


a - 5 ; 
where X; =(21;,.+4)Xgj)-+X qj) is a set of auxiliary variables for which the population 
totals are known. It would appear logical to apply this same model to generate fZ, to 
be used as a robust estimate of the parameter “;, as well as in the estimation of the 


upper and lower bias parameters, By; and B;. 
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QUESTION 2: Is it logical to apply the same regression model used for the 
generalised regression estimator, to generate the regression parameters for the 
winsorized cutoff values? Should this same model be used to form winsorized 
cutoff values for all variables collected in the survey? If a regression model is 
known which fits another variable better, then should this regression model be 
used to form winsorization cutoff values for this other variable, even though it was 
not used for the generalised regression estimator? 


There are many solutions to the problem of estimating robust regression parameters. 
It is worth noting that the task is not to find the best robust regression model, but 
rather to find a robust regression fitting procedure which results in the best 
performing winsorized estimator. Furthermore, since this procedure will have to be 
used for many ABS business surveys and understood by a range of people, it is 
desirable that the procedure is simple, flexible and easily traceable. An investigation 
was performed using a number of different robust regression fitting techniques to 
determine which method resulted in the best performing winsorized estimator. The 
techniques are listed in the following paragraphs with the results of the investigation 
presented in Section 2.3. 


2.2.1 Trimmed least squares 


The Trimmed Least Squares (TLS) technique consisted of fitting an Ordinary Least 
Squares (OLS) regression model to minimise the function: 


co (9 - BT x,) 


ies 


The residuals were calculated by applying the regression model back to the data used 
to fit the model. A percentage of the units with the largest positive and negative 
residuals were then removed from the data. A second regression model was then 
fitted to the reduced data, to estimate the robust regression parameters. The 
percentage of units removed and actual method of removing these units was varied in 
the investigation. The TLS technique has the advantage that it is extremely quick to 
run, simple to understand and easy to trace. Standard regression diagnostics can be 
generated from the fit of the regression model. 
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2.2.2 Trimmed least absolute value or L* regression technique 


The Trimmed Least Absolute Value (LAV) or L' Regression Technique consisted of 
fitting a regression model to minimise the function: 


FD 


ies 


Vi - B' x; 


The residuals were calculated by applying the regression model back to the data used 
to fit the model. A percentage of the units with the largest positive and negative 
residuals were then removed from the data. A second regression model was then 
fitted to the reduced data, to estimate the robust regression parameters. The 
percentage of units removed and actual method of removing these units was varied in 
the investigation. 


This technique is very similar to the current method used to perform winsorization in 
ABS business surveys using ratio estimation. The difference is that the current 
method truncates the data to a percentile value (10% and 90%) rather than removing 
the data. The LAV technique should result in a more robust regression model than 
the TLS technique because large residuals have less influence on the regression 
parameters, since the residuals are not squared. 


2.2.3 Sample splitting technique 


The Sample Splitting (SS) Technique consists of fitting an OLS regression model after 
the data has been randomly split into two halves. The ‘residuals’ were calculated by 
applying the regression model back to the half of the data not used to fit the model. 
The units with the largest positive and negative residuals were then removed from the 
data after the two halves were then merged back together. This process was repeated 
until a percentage of the units had been removed. The SS technique should result in 
a more robust regression model than the TLS technique because the residuals used to 
remove the ‘outlier’ units are not calculated from a regression model that has been 
generated using these ‘outlier’ units. 


2.2.4 Least median of squares 


The Least Median of Squares (LMS) technique, described by Rousseeuw and Leroy 
(1987), consisted of minimising the median of all sample squared residuals. The LMS 
regression parameters cannot be found analytically, so a resampling technique similar 
to the bootstrap is applied to find an approximate solution. The LMS technique is 
approximated by calculating the median of squared residuals of many trial regression 
parameters, and then selecting the regression parameters with the smallest median of 
squared residuals. The LMS technique should result in a more robust regression 
model than the TLS technique because it has the effect of fitting an OLS regression 
model in the absence of "outlier" units, without totally removing these ‘outlier’ units. 
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2.3 Simulation study 


A simulation study was undertaken to assess the performance of these various 
techniques to estimate robust regression parameters, and hence estimate the cutoff 
values used in the winsorized estimator. The simulation study examined: 


° the ‘best’ robust regression fitting technique (Section 3.3.1) 

. the data used to calculate cutoff values under winsorization (Section 3.3.2) 

° the level to calculate bias parameters under winsorization (Section 3.3.3) 

° the sample weights to calculate the cutoff values under winsorization (Section 
3.3.4) 


The simulation study was performed using a survey population of approximately 
700,000 units, based on the survey frame used for the Quarterly Economic Activity 
Survey (QEAS). QEAS uses a stratified random sample design with strata defined by 
the variables state, industry and employment size. The total sample size is 
approximately 16,400. The reported QEAS sales variable was used as the response 
variable for the study. BAS wages and BAS turnover values were merged to the frame, 
to be used as the auxiliary variables. For the non sampled units on the frame a QEAS 
sales value was generated using a regression model involving BAS wages, BAS 
turnover, frame employment and an error term: 


¥, = A+B, x), + Blx>; +B3x3; +; 


where 


y, is the predicted QEAS Sales for unit 7 

x4; is the BAS Wages for unit 7 

X, is the BAS Turnover for unit Z 

X3; is the Business Register Employment for unit 7 

& is the intercept parameter from fitting the model on the sampled units 

B, is the BAS Wages parameter from fitting the model on the sampled units 

B, is the BAS Turnover parameter from fitting the model on the sampled units 


B; is the Business Register Employment parameter from fitting the model 


on the sampled units 
€, is random noise for unit 7 from N(0,6;,) 


G- is the variance of the predicted value for unit 7 
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The regression model was fit at stratum level wherever there were sufficient sampled 
units. Where there were less than five responding units in a stratum the QEAS sales 
value was generated from a model fit at employment size level. 


2.3.1 ‘Best’ robust regression fitting technique 


The simulation study consisted of selecting three independent stratified random 
samples to generate cutoff values under the various robust regression fitting 
techniques, and then applying these cutoff values to another independent stratified 


random sample to calculate the winsorized estimator, b vibin . This process was 


repeated a large number of times, R. The measures used to assess the performance of 
the various robust regression fitting techniques were the Mean Squared Error (MSE) 
and the bias: 


A 


ae 5 
MSEC yin) = . > C amet 
r=1 


A 


Pes 
Bids Lyin) = R > Cywings = ty) 
r=1 
where ¢ ywiny is the winsorized estimator for the r-th simulated sample selected from 
the population, and ty, is the known population total. 


The MSE and bias under the various robust regression fitting techniques were 
compared with the MSE and bias of the ‘unwinsorized’ estimator: 


»,_1eé, 
MSE(ty) == DV Gyr ty 
=1 


a. i eee 
Bias(t,) = pouty =) 
=I 


The percentage reduction in MSE for the various robust regression fitting techniques 
was calculated as: 


MSE yin) — MSE(ty) 


MSE Reduction = ~ x100% 
MSE(t,) 
Bias(t y:,)— Biastt 
Bias Reduction = “y win) z Cy) x100% 
Bias(t,, ) 


and the methods with the largest percentage reduction in MSE and smallest 
percentage bias were considered the ‘best’. 
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The regression parameters were calculated at the industry level, using an intercept 
term and two auxiliary variables, BAS wages and BAS turnover. For the TLS and SS 
methods 5% of units were removed. The regression models were fit at the industry 
level since it appeared logical to fit the regression model at the same level as the 
generalised regression model. There were major problems associated with the LAV 
technique when several auxiliary variables were attempted, because the matrices in 
the minimisation problems were frequently ill conditioned. Therefore, the results for 
the LAV technique are based on a single auxiliary variable, BAS Turnover. The 
percentage reduction in MSE and percentage bias for the various robust regression 
fitting techniques are presented in Table 1. 200 Repetitions (R = 200) were used to 
generate these results. 


Table 1: Percentage reduction in MSE and bias of Winsorized estimates for various techniques 


Robust Regression 


Fitting Techniques MSE Reduction (%) Bias (%) 
TLS -57.93 -0.31 
LAV -29.62 -1.14 
SS -58.46 -0.34 
LMS 17.37 -1.81 


The LMS technique performed quite poorly, with an increase in MSE, due to a larger 
bias. The explanation for this poor performance is that while this technique results in 
a very good model for the core of the data, it can result in a very poor model for the 
tails of the distribution. Since most distributions in ABS business surveys are positively 
skewed (i.e. large upper tails), this method is more likely to result in very large values 
of D, for units in the upper tails of the distribution, where the majority of ‘outlier’ 
units are located. Hence this technique has the potential to result in very high bias 
parameters. 


While the LAV technique did not perform as well as the TLS and SS techniques, this 
was primarily due to the fact that the LAV technique was based on a single auxiliary 
variable. Indeed, the performance of the LAV technique was similar to the TLS and SS 
techniques based on single auxiliary variables. The SS and TLS techniques performed 
very well, with the exception of several industries. The percentage reduction in MSE 
and percentage bias for the various robust regression fitting techniques at the industry 
level are presented in the Attachment. 


In light of the poor performance in industries 18, 30 and 35 some further investigation 
was undertaken into the level of fit of the regression model, the handling of influential 
units and the percentage of units to be removed. While these three industries 
performed much better, when the regression parameters were calculated at the 
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stratum level, rather than the industry level, most other industries performed slightly 
worse. This suggests that there might be a need to fit of the regression model at 
different levels to the generalised regression model. 


QUESTION 3: Is it logical to fit a regression model at different levels to the 
generalised regression model? 


Generally speaking, the greater percentage of units removed from the fit of the 
regression model, the more robust the regression parameters. However, caution 
should be taken not to remove too many units, as this can lead to excessive large bias 
parameters. An investigation was undertaken into the impact on the MSE and bias of 
the various methods when the percentage of units to be removed was varied. This 
investigation found that the optimal percentage of units to be removed varied across 
industries. Furthermore, there was no conclusive evidence to suggest whether it was 
better to remove units based on weighted or unweighted residuals. 


The investigations also found that there were some industries where the regression 
model fitted to the three historical samples was very different from the regression 
model fit to the current data. This indicates that diagnostics should be incorporated 
into the regression fitting procedure before using the regression parameters to 
generate cutoff values. In particular, checking that the regression model fits the data 
well, checking that units with large influences are removed from the regression 
model, and checking that the regression model fits to current data to be winsorized. 


QUESTION 4: What is the best way to ensure the regression model fitted to the 
historical data is still applicable to the current data to be winsorized? What action 
should be taken when the regression model does not fit to the current data to be 


winsorized? 


2.3.2 Data used to calculate cutoff values 


The current practice for ABS business surveys is to use several cycles of historical data 
to estimate the regression parameters and bias parameter, and assume that the same 
regression model holds for the current data. This practice was based on work by 
Clark (1995) who suggested that the quality of parameters can be improved by using 
more data. The results of the simulation study support these findings, showing a 
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much tighter distribution of the bias parameter when more cycles of the data are 
used. 


An exception to this practice is the monthly Retail Trade Survey which is affected by 
seasonal variation. In this situation it is more effective to use a single cycle of 
historical data, the corresponding month in the previous year, to estimate the 
parameters. There is a need for further investigation into weighting the various cycles 
of historical data to maximise the stability of parameters over time, in order to 
minimise the impact on the movement estimates. 


Several ABS business surveys have experienced problems with individual units with 
large historical values that adversely influence the estimate of the bias parameter. It 
can be seen that if the largest weighted residual, Day , is more than double the second 
largest weighted residual, Dy) , then the bias parameter calculation algorithm will 
stop after the second breakpoint, Y%, (Dey ) = 2D) - Day <0. The estimate of the 
Day + Day . 


3 
the unit with the largest weighted residual was not present or was smaller then 


If 


D 
bias parameter generated will be an interpolation between — and 
2 


estimate may be quite different. 


Another problem with these large historical values is that they can make the bias 
parameters unstable over time, and hence result in large impacts on movement 
estimates. The current practice is to remove units with large historical values which 
have an adverse influence on the estimate of the bias parameter. This is justified by 
assuming the units come from a different population entirely and so including them 
does not add any information about the tail of the distribution of interest. These units 
should be removed with caution however, otherwise it could lead to cutoff values that 
are too small. 


QUESTION 5: What is the best way to deal with units with large historical values 
which have an adverse influence on the estimate of the bias parameter? 


2.3.3 Level of calculate bias parameters 


The level at which the bias parameters, By, and B;, are calculated will determine the 
performance of estimates at the various levels. If the bias parameters are calculated at 
a broad level (e.g. Australia or industry levels), then these broad level estimates should 
perform well, but the finer level estimates may have large variances, as ‘outlier’ units 
may be undetected at these levels. On the other hand, if the bias parameters are 
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calculated at finer levels, then these fine level estimates should perform well, but the 
broad level estimates may exhibit large biases, as too many units may be winsorized at 
these levels. 


The solution suggested by Chambers, Kokic, Smith and Cruddas (2000), and current 
implemented for ABS business surveys, is to calculate the bias parameters at the most 
important level. This treatment could well produce poor quality estimates of at the 
finer levels. In practice any further units that adversely affect the estimates at the finer 
levels have usually been made surprise outliers (i.e. had their weight set to one) to 
overcome this problem. It is expected that some surprise outliering will always be 
required regardless of the winsorisation methodology, although a compromise 
solution, to calculate the bias parameters at an artificial intermediate level, has 
promise for reducing the extent of surprise outliering required. 


An alternative approach suggested to this problem is to calculate bias parameters at 
broad and fine levels. The fine level estimates would then be modified using rescaling 
factors such that they are consistent with the broad level estimates. Although this 
method has its merits, it has several disadvantages. Firstly, under this approach either 
the unit record data will no longer add to published estimates; or the rescaling factors 
will need to be applied to all units in the survey. Secondly, this approach will become 
very complex where there are large number of data items or relationships between 
the data items. 


2.3.4 Sample weights used to calculate cutoff values 


The estimates of regression parameters, fZ;, and bias parameters, Be and Bi are 
usually generated from historical data and hence are treated as independent from the 
current data. However, the cutoff values do depend on the current data through the 
sample weights, w; (i.e. generalised regression weights under generalised regression 
estimation). The use of generalised regression weights to calculate the cutoff values 
means that the generalised regression weights need to be available to perform 
winsorization. On the other hand, the use of design weights to calculate the cutoff 
values has the advantage that the cutoff values can be generated in advance of the 
generalised regression weights to allow sufficient time for quality checking. 
Furthermore, the use of design weights also simplifies variance estimation under the 
bootstrap methodology, since the same winsorized values can be used in all replicate 
samples, rather than being calculated separately for each replicate sample. Therefore, 
an investigation was undertaken into the performance of the winsorized estimator 
using the design weight, w; = 1/7; , in place of the generalised regression weight, 4, , 
to generate the cutoff values. 


The investigation used three independent samples to generate cutoff values based on 
TLS technique, with 5% of units removed. The regression parameters were calculated 
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at the industry level, using two auxiliary variables, BAS wages and BAS turnover. These 
cutoff values were then applied to a fourth independent sample. This process was 
repeated 200 times. The relative differences between the winsorized estimates when 
cutoff values are calculated using the generalised regression weights and design 
weights are presented in Table 2. 


Table 2: Relative difference between Winsorized estimates 
using generalised regression weight and design weight 


Difference between Percentage 
Winsorized Estimates of Units 
0-0.5 % 88.5 
0.5-1 % 9.0 
1-3 % 2.5 
3-5 % 0.0 
5-10 % 0.0 
> 10% 0.0 


At the Australia level, 88.5% of independent samples resulted in less than 0.5% 
difference between the winsorized estimates using the generalised regression weights 
and design weights. At the industry level, most estimates differed by less than 3.0%. 
Those industries with the larger differences (i.e. Industries 18, 26, 30 and 35) have 
already been identified as having questionable regression models throughout the 
simulation study. The relative differences between the winsorized estimates using the 
generalised regression weights and design weights at the industry level are presented 
in Attachment 1. 


QUESTION 6: Is there any reason why the design weights should not be used in 
the calculation of cutoff values used in the winsorized estimator? 
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3. LINEAR RELATED ITEMS 


Most ABS business surveys collect and derive a wide range of data items. 
Furthermore, there often exists linear relationships between these data items. For 


example, suppose a survey collects a set of variables, 1, V2,...., Vx, and a derived 


K 
variable, Vo, is calculated as a linear combination of these variables, yo = ©} adpyp. In 
k= 


this situation there exists the following linear relationship between the variables, 

K 

D 4pVp; =0, where ad) =—1. It is important that these linear relationships still hold 
k=0 
after winsorization. Therefore, an important issue for winsorization is to develop a 
method to maintain the linear relationships between the variables. 


In theory, winsorization can be applied to all the survey variables. However, in most 
cases the linear relationships between these variables will no longer hold after 
winsorization. The current ABS estimation system allows the linear relationships to be 
maintained by either: 


° winsorizing the set of component variables, y;,)>,-..-,))x, and then calculating 
the ‘winsorized’ derived variable based on these winsorized component 


variables, Vo; = X pV pj 3 OF 
k=l 


F 
° winsorizing the derived variable, ¥o;, and then calculating the ‘winsorized’ set of 
component variables by applying the same proportional adjustment, 


Ba 


y bi = Yor Vi « 
V0i 

Unfortunately, there are some situations where these two methods perform quite 
poorly. The major problem with the first of these methods is that it could result in 
very poor values for the ‘winsorized’ derived variables, in particular where some of the 
d, are negative. For example, suppose profits Vo is derived based on total income 
y, minus total expenses yz plus opening stocks yz minus closing stocks yy, (i.e. 
Vo = -—V2+.N3-—Y4). Suppose a unit has an unusually large total expenses, but total 
income, opening stocks and closing stocks are not unusual. Furthermore, suppose 
the derived profit for the unit is negative. Using the first of the methods under the 
current ABS estimation system, the ‘winsorized’ derived profit for the unit could easily 
end up positive, since only total expenses is winsorized. This treatment could well 
have a detrimental impact on the sign of the estimates of profit from the survey. 


The major problem with the second of these methods is that it could result in very 
poor values for the ‘winsorized’ set of component variables, in particular where some 
of the components of a derived variable are usually much smaller than other 
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components. For example, suppose total income is derived based on a number of 
variables, including sales income (generally a large component of total income) and 
royalties income (generally a small component of total income). However, suppose a 
unit has an unusually large royalties income, but total income is not unusual. Using 
the second of the methods under the current ABS system, the royalties for the unit 
will not be winsorized, since total income is not winsorized. This treatment could well 
produce poor quality estimates of royalties from the survey. In practice these units 
have usually been made surprise outliers (i.e. had their weight set to one) to 
overcome this problem. Another problem with this method is that it can be quite 
cumbersome to maintain multiple linear relationship between the variables. 


Chambers, Kokic, Smith and Cruddas (2000) suggested an alternative to the second 
method, which distributes the difference between the original and winsorized derived 
variable amongst the largest component variables. This method is based on the 
principle that an outlier on the derived variables will generally be due to one or 
several of the component variables being unusually large, rather than all the 
component variables. Let yi) 2 (2) 2---2 ix) denote the ordered set of component 
variables with coefficients 4(1),42),---4x) , then the ‘winsorized’ set of component 
variables are computed using the equation: 


ES 


J * 
Dy Ak) )(k)i + Voi ~ Voi 
* _ k=l 
Vi = 


where /° is the largest value of for which 


J Z 
V ji = Voi Yor - 3 Ap) V(k)i + (k)i a ap) 29 
k=1 k=1 
It should be noted this method cannot be applied in its current form in the situation 
where some of the a, are negative. However, it is relatively simple to modify this 
method to be appropriate for this situation, and where there are two-sided winsorized 
cutoff values. While this method has its merits, it suffers from the same problems as 
the second of the methods under the current ABS estimation system. 


Another alternative method which attempts to overcome the shortcomings of the two 
methods under the current ABS estimation system is to winsorize all the survey 
variables and then modify these winsorized values, so that the linear relationships 
between these variables still hold, by a process known as calibration. A new set of 


winsorized values for variable k for unit 7, Vee , are sought which lie as close as possible 
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to the set of original winsorized values, ee The calibration requires the specification 


of distance function between the original and final winsorized values. 


Although any one of a number of distance functions could be used, one of the most 
commonly used is the generalised least squares distance function: 


where c, are specified positive factors that control the relative importance of the 


variables. 


QUESTION 7: Is the concept of minimising a distance function to ensure linear 
relationships between variables still holds after winsorization appropriate? Is the 
generalised least squares distance function appropriate? 


Minimisation of the generalised least squares distance function using Lagrange 


K seek 
multipliers, subject to satisfying the linear relationship constraint, L 4gVz; =9, leads 
k=0 


to the final winsorized values: 


K 
a * 
[ae > AgV Ri 
1 


Vi if yp; 20 


sf Yigg <0 


This method can easily be extended to multiple linear relationship constraints (i.e. 


K reek 

> 4pV pj =9). One disadvantage of this method is that some of the final winsorized 
k=0 
values can be negative for variables which should always be positive (and vice versa). 


This problem can be overcome by imposing range restrictions on the final winsorized 


values, L< ee <U, where LZ and U are suitable lower and upper bounds. In order to 
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satisfy the linear relationship constraints and the range restrictions, the calculation of 
the final winsorized values needs to be undertaken using an iterative method. 


The first of the methods under the current ABS estimation system is a special case of 
this alternative method. If the factors for the set of component variables are all set to 
infinity (i.e. Cc) =Cy =...=Cx =e) and a) =—1 then the final winsorized values 
simplify to: 
Wiss LOD RAN 2. uc Ke 
TR 
Vei = K : 
> aeV pi fork =0 
k=l 


which is equivalent to the first method under the current ABS estimation system. On 
the other hand, the second of the methods under the current ABS estimation system 
is not a special case of this alternative method. 
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4. CONCLUSION 


The effectiveness of winsorization at robustifying the GREG estimator against 
unusually large residuals will ultimately depend on the choice of the cutoff values, and 
hence the methods used to estimate the bias parameters and regression parameters. 
This paper has investigated several techniques for fitting robust regression models and 
found that the winsorized estimator performs best under models that are only 
moderately robust. Some conceptual questions have been raised about the link that 
should exist between the GREG model and the model used to estimate regression 
parameters for cutoffs. Current thinking is that the models should involve the same 
auxillary variables and be fitted at the same level, however the simulation study found 
cases where different models improved performance. 


One of the key findings of the study was that diagnostics should be incorporated into 
the regression fitting procedure before using the regression parameters to generate 
cutoff values. In particular, checking that the regression model fits the data well, 
checking that units with large influences are removed from the regression model, and 
checking that the regression model fits to current data to be winsorized. 


The winsorization methodology described in this paper works well for point in time 
estimates. Further work to assess the performance of winsorization on GREG 
estimates of movement between time points may prove useful. Movement estimates 
are a key output for many ABS business surveys. It has been noted that large historical 
values can make bias parameters unstable over time and hence impact on movement 
estimates. Further work to determine the best way of dealing with these values and 
the best way of weighting various cycles of historical data to maximise stability of 
parameters over time is recommended. 


Linear relationships between data items is another area warranting further 
investigation. Shortcomings of the current ABS estimation system’s ability to handle 
linear relationships between survey variables have been discussed and an alternative 
method presented. The alternative method requires specification of a distance 
function between original and final winsorized values. Investigation is planned into 
the performance of this method and the suitability of the generalised least squares 
distance function. 


The simulation study presented here independently replicated the process of 
selecting historical samples, estimating cutoffs and applying cutoffs to independent 
samples. A large number of replicates were generated to produce estimates of the 
MSE and bias of the winsorized estimator. In practice only a single set of cutoffs is 
generated, based on historical data, and used to winsorize the present sample. This 
introduces a source of variability which is not reflected in current variance estimates 
and which would be difficult to incorporate. The simulation study found cutoffs to be 
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more stable when several cycles of historical data was used to estimate parameters 
and this is the approach that will be implemented for GREG estimation. Further work 
could involve using the data from the simulation study to quantify the significance of 
the variability introduced through estimating cutoffs. 
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ATTACHMENT 


Table 1: Percentage reduction in MSE and bias of Winsorized estimates for various methods 


TLS 
Industry Method 


LAV 
Method 


SS 


LMS 
Method 


TLS 
Method 


LAV 
Method 


SS 
Method 


LMS 
Method 


O01 -19.32 
03 -49.77 
04 -57.41 
05 —52.86 
06 -50.97 
O07 -42.57 
08 -40.60 
10 —25.09 
11 -65.66 
12 -18.13 
13 -66.00 
14 -27.38 
15 —27.75 
16 -39.63 
17 —51.01 
18 -50.01 
19 -44,.12 
22 -32.63 
25 -43.36 
26 -39.89 
27 —49.20 
28 -46.59 
30 -19.45 
31 0) 
33 -46.81 
35 -37.43 
36 —56.52 
37 —46.08 
38 -32.17 
39 -29.47 
40 -77.02 
41 -67.76 
42 -41.23 
43 0) 
44 0) 
45 0) 
46 0) 
47 0) 
48 0) 
76 -19.08 
Total —57.93 
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Table 2: Relative difference between Winsorized estimates using generalised regression weight 
and design weight 


Industry 0-0.5 % 0.5-1 % 1-3 % 3-5 % 5-10 % > 10% 
O01 83.5 14.5 2.0 fe) 0) 0) 
03 91.0 5.5 3.5 fe) 0) 0) 
04 64.0 16.0 17.5 25 0) 0) 
05 70.5 18.5 10.5 0.5 0) 0) 
06 44.5 23.5 23.0 8.0 1.0 ) 
O7 74.0 11.5 13.5 1.0 O 0) 
08 94.0 5.5 0.5 ) O 0) 
10 88.0 10.0 2.0 fe) 0) 0) 
11 77.0 14.5 8.0 0.5 0 0) 
12 20.0 22.0 39.5 14.5 3.5 0.5 
13 50.5 19.5 27.5 2.5 0) 0) 
14 70.0 18.0 11.5 0.5 0) ) 
15 81.5 15.5 3.0 fe) 0 0) 
16 77.5 16.0 6.5 ) 0) 0 
17 74.5 16.0 9.0 0.5 0) 0) 
18 34.5 12.0 30.5 10.5 8.5 4.0 
19 84.5 12.5 3.0 ) 0) 0) 
22 76.5 17.0 6.0 0.5 0) 0) 
25 93.0 6.0 1.0 fe) 0 0) 
26 23.5 13.0 33.0 12.0 16.5 2.0 
27 30.0 19.5 30.5 11.0 6.0 3.0 
28 36.5 20.0 34.5 5.5 2.0 1.5 
30 9.5 6.5 35.0 16.5 21.0 11.5 
31 100.0 0) ) ) O 0) 
33 91.0 7.0 2.0 ) O 0) 
35 20.5 13.0 31.0 18.5 12.0 5.0 
36 38.5 24.5 27.5 7.0 2.0 0.5 
37 64.5 14.5 17.5 3.5 0 0) 
38 77.0 16.0 6.0 1.0 O ) 
39 67.5 15.5 15.0 2.0 O ) 
40 44.0 18.5 29.5 3.5 2.5 2.0 
41 45.5 17.5 26.5 8.5 2.0 0) 
42 35.5 12.0 39.0 10.5 2.5 0.5 
43 100.0 0) fe) ) 0 0) 
44 100.0 0) fe) ) 0) 0) 
45 100.0 0) fe) ) 0) 0) 
46 100.0 0) 6) fe) 0) 0 
47 100.0 0) fe) ) 0 0) 
48 100.0 0) fe) 6) 0 0) 
76 84.0 12.0 4.0 fe) 0 0) 
Total 88.5 9.0 2.5 6) 0) 0) 
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